Job management with SLURM¶

You should not run your compute code directly on the terminal you find when you log in.

In order to submit a job on the DGX, you need to describe the resources (partition, MIGs, CPUs) you need to the task manager Slurm. The task manager will launch the job as soon as the resources you need will be available.

You can run up to 4 jobs at a given time. All subsequent requests will be put on hold until one of your previous jobs is completed.

There are two ways to run a compute code on the DGX:

using a interactive Slurm job: this will open an interactive session where you can execute your code. This method is well-suited for light tests and environment configuration (especially for GPU accelerated codes). See the section Interactive jobs.
using a Slurm script: this will submit your script to the scheduler, which will run it when the resources are available. This method is well-suited for "production" runs.

Slurm is configured with a "fairshare" policy among the users, which means that the more resources you have asked for in the past days and the lower your priority will be for your jobs if the task manager has several jobs to handle at the same time.

You can check anytime your usage and fairshare information with the command sshare -l.

You can also check the priority information of the pending jobs with sprio.

In addition to that page which documents slurm commands in the context of the DGX, you can check the slurm workload manager documention.

Slurm script¶

Most of the time, you will run your code through a Slurm script. This script has the following functions:

specify the resources you need for your code: partition, type of MIG, how many tasks and CPUs per task;
specify other parameters for your job (project which your job belongs to, output files, mail information on your job status, job name, etc.);
setup the batch environment (load modules, set environment variables);
run the code!

Running the code will depend on your executable. Parallel codes may have to use srun or having specific environment variables set.

Slurm directives¶

You describe the resources you need in the submission script, using sbatch instructions (script lines beginning with #SBATCH). These specifications can be used directly with the sbatch command, or listed in a script. Using a script is the best solution if you want to submit the job several times, or several similar jobs.

Mandatory slurm directives on the DGX¶

partition¶

Mandatory: there is no default partition, thus you must specify this setting by choosing a partition following the needed resources in the list of available partitions. Then to specify the Slurm partition your job will be assigned:

#SBATCH --partition=<PartitionName>

gres¶

Mandatory: specify which type of MIG you want, by using generic ressources (gres):

#SBATCH --gres=gpu:<Type>:<Quantity>

with <Type>:<Quantity> either 1g.10gb:1 for partition prod10, 2g.20gb:1 for prod20 or 3g.40gb:1 for prod40.

ntasks¶

Mandatory: specify the number of tasks (MPI processes):

#SBATCH --ntasks=<ntasks>

If you don't need to run commands in parallel, just ask for one task (#SBATCH --ntasks=1).

cpu-per-task¶

Mandatory: specify the number of threads per process (Ex: OpenMP threads per MPI process):

#SBATCH --cpus-per-task=<ntpt>

SBATCH other directives¶

error¶

Define the error output (stderr) for your job:

#SBATCH --error=</path/to/errorJob>

By default both standard output and standard error are directed to the same file.

export¶

Export user environment variables. By default all user environment variables will be loaded (--export=ALL). To avoid dependencies and inconsistencies between submission environment and batch execution environment, disabling this functionality is highly recommended.

In order to not export environment variables present at job submission time to the job's environment:

#SBATCH --export=NONE

To select explicitly exported variables from the caller's environment to the job environment:

#SBATCH --export=VAR1,VAR2

You can also assign values to these exported variables, for example:

#SBATCH --export=VAR1=10,VAR2=18

job-name¶

Define the job's name:

#SBATCH --job-name=<jobName>

mail-type¶

To be notify by mail (defined by mail-user) when a step has been reached:

#SBATCH --mail-type=ALL

Arguments for -mail-type option are:

BEGIN: send an email when the job starts
END: send an email when the job stops
FAIL: send an email if the job fails
ALL: equivalent to BEGIN, END, FAIL.

mail-user¶

Set an email address (useful to be notified according to the option chosen with mail-type):

#SBATCH --mail-user=firstname.lastname@mywebserver.com

If used, this must not be empty.

output¶

Define the standard output (stdout) for your job:

#SBATCH --output=</path/to/outputJob>

The default is --output=slurm-%j.out (the %j in the filename will be replaced by the job ID automatically at file creation). If you need to direct the stdout to a specific directory, you must first create the directory, say logs, and then set the option as --output=logs/slurm-%j.out.

propagate¶

By default all resources limits (obtained by ulimit command like stack, open files, nb processes, ...) are propagated (--propagate=ALL). To avoid the propagation of interactive limits and erase batch resources limits, it is encouraged to disable the functionality:

#SBATCH --propagate=NONE

time¶

Specify the walltime for your job (within the Max Walltime of the partition). If your job is still running after the walltime duration, your job will be killed:

#SBATCH --time=<hh:mm:ss>

Submit and monitor jobs¶

Examples of template of batch file to execute a main.py file¶

A somewhat general template of a script job.batch which run a main.py file, including all mandatory directives (partition, gres, ntasks and cpus-per-task):

#!/bin/bash
#
#SBATCH --job-name=job
#SBATCH --output /path/to/slurm-%j.out
#SBATCH --error /path/to/slurm-%j.err

## For partition: either prod10, prod 20, prod 40 or prod80
#SBATCH --partition=prod10

## For gres: either 1g.10gb:1 for prod10, 2g.20gb:1 for prod20, 3g.40gb:1 for prod40 or A100.80gb:1 for prod80.
#SBATCH --gres=gpu:1g.10gb:1

## For ntasks and cpus: total requested cpus (ntasks * cpus-per-task) must be in [1: 4 * nMIG] with nMIG = VRAM / 10 (hence 1, 2, 4, 8 for 1g.10gb, 2g.20gb, 3g.40gb, A100.80gb).
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4

## Perform run
python3 /path/to/main.py

Another one using a virtual environment, a logslurms directory (for the output and error) and a working_directory (containing the main.py) in the user home:

#!/bin/bash
#
#SBATCH --job-name=job
#SBATCH --output=~/logslurms/slurm-%j.out
#SBATCH --error=~/logslurms/slurm-%j.err

## For partition: either prod10, prod 20, prod 40 or prod80
#SBATCH --partition=prod10

## For gres: either 1g.10gb:1 for prod10, 2g.20gb:1 for prod20, 3g.40gb:1 for prod40 or A100.80gb:1 for prod80.
#SBATCH --gres=gpu:1g.10gb:1

## For ntasks and cpus: total requested cpus (ntasks * cpus-per-task) must be in [1: 4 * nMIG] with nMIG = VRAM / 10 (hence 1, 2, 4, 8 for 1g.10gb, 2g.20gb, 3g.40gb, A100.80gb).
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4

## Virtual environment
source ~/env/bin/activate

## Perform run
CUDA_VISIBLE_DEVICES=1 time python ~/working_directory/main.py

In both examples, standard output (stdout) will be in the slurm-%j.out file (the %j will be replaced by the job ID automatically) and the standard error (stderr) will be in the slurm-%j.err file.

submit job¶

You need to submit your script job.batch with:

$ sbatch /path/to/job.batch
Submitted batch job 29509

which responds with the JobID attributed to the job. For example here, JobID is 29509. The JobID is a unique identifier that is used by many Slurm commands.

monitor job¶

The squeue command shows the list of jobs:

$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
29509 prod10        job  username  R       0:02      1 dgxa100

You can change the default format through the SQUEUE_FORMAT variable. For example by adding the following in your .bash_profile:

export SQUEUE_FORMAT="%.18i %.14P %.8j %.10u %.2t %.10M %20b %R"

resulting in replacing the NODES information (always 1 since there is only the DGX) by the MIG required by the job (column TRES_PER_NODE):

JOBID      PARTITION     NAME       USER ST       TIME TRES_PER_NODE        NODELIST(REASON)

For more squeue format option, see

cancel job¶

The scancel command cancels jobs.

To cancel your job with jobid 29509 (obtained when submitting or through squeue), you would use:

$ scancel 29509

interactive jobs¶

For an interactive session:

the partition must be interactive10;
the reserved MIG must be one 1g.10gb;
total CPUs requested (ntasks * cpus-per-task) must not exceed 4 CPUs
example:

$ srun --partition=interactive10 --gres=gpu:1g.10gb:1 --ntasks=1 --cpus-per-task=4 --pty /bin/bash
$ squeue
             JOBID      PARTITION     NAME       USER ST       TIME TRES_PER_NODE        CPUS MIN_MEMORY NODELIST(REASON)
               462  interactive10     bash  username  R       0:05 gres:gpu:1g.10gb:1      4      4000M dgxa100
$ nvidia-smi
Thu Jul 13 13:01:11 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:01:00.0 Off |                   On |
| N/A   48C    P0    57W / 275W |     45MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  Off  | 00000000:47:00.0 Off |                   On |
| N/A   48C    P0    65W / 275W |     45MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  Off  | 00000000:81:00.0 Off |                   On |
| N/A   48C    P0    58W / 275W |     45MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA DGX Display  Off  | 00000000:C1:00.0 Off |                  N/A |
| 34%   44C    P8    N/A /  50W |      1MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  Off  | 00000000:C2:00.0 Off |                   On |
| N/A   47C    P0    56W / 275W |     48MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    7   0   0  |      6MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

To close the session, use the exit command.

job arrays¶

Job arrays are only supported for batch jobs and the array index values are specified using the --array or -a option of the sbatch command. The option argument can be specific array index values, a range of index values, and an optional step size as shown in the examples below. Jobs which are part of a job array will have the environment variable SLURM_ARRAY_TASK_ID set to its array index value.

# Submit a job array with index values between 0 and 31
$ sbatch --array=0-31 job

# Submit a job array with index values of 1, 3, 5 and 7
$ sbatch --array=1,3,5,7 job

# Submit a job array with index values between 1 and 7
# with a step size of 2 (i.e. 1, 3, 5 and 7)
$ sbatch --array=1-7:2 job

The subjobs should not depend on each other. SLURM can start these jobs in every order, at the same time or not.

chain jobs¶

If you want to submit a job which must be executed after another job, you can use the chain function in slurm.

$ sbatch slurm_script1.sh
Submitted batch job 74698
$ squeue 
JOBID PARTITION     NAME     USER      ST    TIME    NODES  NODELIST(REASON)
74698  *******      *******  username  PD    0:00    *      *******
$ sbatch --dependency=afterok:74698 slurm_script2.sh
Submitted batch job 74699
$ sbatch ---dependency=afterok:74698:74699 slurm_script3.sh
Submitted batch job 74700

Note that if one of the jobs in the sequence fails, the following jobs remain by default pending with the reason “DependencyNeverSatisfied” but can never be executed. You must then delete them using the scancel command. If you want these jobs to be automatically canceled on failure, you must specify the --kill-on-invalid-dep = yes option when submitting them.

Here are the common chaining rules:

after:<jobID> = job can start once job <jobID> has started execution
afterany:<jobID> = job can start once job <jobID> has terminated
afterok:<jobID> = job can start once job <jobID> has terminated successfully
afternotok:<jobID> = job can start once job <jobID> has terminated upon failure
singleton = job can start once any previous job with identical name and user has terminated

Accounting¶

Use the command sacct to get info on your finished jobs, and sacct -j JobID for a specific job with ID JobID.